This project aims to explore which chemical properties influence the quality of White wines. The data set used for this project contains 4898 instances of different variants of the Portuguese “Vinho Verde” wine and can be downloaded here.
12 variables are included in the data set, where 11 of them are quantifying the chemical properties of each wine and one variable contains a rating between 0 (very bad) and 10 (very excellent) on the quality of the wine. This quality score is the median evaluation of at least 3 wine experts.
For more information on the data set, read this text file.
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
Input variables (based on physicochemical tests):
1. fixed acidity: most acids involved with wine or (are?) fixed or nonvolatile (do not evaporate readily)
2. volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3. citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4. residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5. chlorides: the amount of salt in the wine
6. free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7. total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8. density: the density of water (wine?) is close to that of water depending on the percent alcohol and sugar content
9. pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10. sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11. alcohol: the percent alcohol content of the wine
Output variable (based on sensory data):
12. quality (score between 0 and 10)
Information on the variables was found here.
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
The data set consists of 4898 observations and 12 variables. All variables are numberical and quality is an integer specifically. The variables quality, free.sulfur.dioxide and total.sulfur.dioxide appear to be discrete and the rest appear to be continuous. This makes sense for quality which is scale question, but for free.sulfur.dioxide and total.sulfur.dioxide it seems they should be continuous like the rest.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
The summary shows the distribution of the data within a variable. The 1st to 3d quartiles illustrate the range where most of the data fall within and a much lower min value than the value of the 1st quartile or a much higher max value than the value of the 3d quartile indicates that there are outliers in the data. citric.acid seems to have values of zero.
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 19 7 6 2 12 5 6 12 4 12 14 1 19 17 27
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 23 33 27 49 48 70 66 104 83 181 136 219 216 282 223
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 307 200 257 183 225 137 177 134 122 101 117 82 95 37 63
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 46 51 38 39 215 35 25 23 16 19 11 22 13 21 6
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 6 9 14 4 6 8 7 7 7 5 3 9 5 5 41
## 0.78 0.79 0.8 0.81 0.82 0.86 0.88 0.91 0.99 1 1.23 1.66
## 2 2 2 2 2 1 1 2 1 5 1 1
This table of citric.acid shows that there are 19 observations with a value of 0 for this variable. Having a value of 0 might be correct or it might be due to a data entry or rounding error.
To see how the zero values for citric.acid might affect the quality variable these histograms are a comparison of how quality might change. The histogram to the left includes all observations and the histogram on the right includes only observations where citric.acid is greater than 0. The distribution in quality does not seem to be affected by these zero values.
A boxplot can help to illustrate the variability in the data. It seems only the variable alcohol has no outliers at all, this boxplot shape could indicate that the data is distributed into two “lumps” with very few data points in between.
Another way to illustrate the distribution of the data is to create a histogram. Here we can see that the distribution of the data in alcohol is not distributed in two “lumps”, there are many data points in the middle. Most variables seem to have a somewhat normal distribution or a slightly positive skew (long tail). The variable quality is not continuous so the data points are centered around whole numbers so the histogram almost looks like a bar chart. The outliers in some charts makes it harder to see the distribution.
To better see how the variables with outliers in the data are distributed we can transform the data by log10 for example. The variables chlorides, free.sulfur.dioxide and sulphates appear to have a log-normal distribution meanwhile residual.sugar seems to have a bimodal distribution, alcohol a mutlimodal distribution and for density it is still hard to see the distribution because of the outliers.
By limiting the range included in the histogram for density, it is easier to see the shape of the data, which seems to have a somewhat normal distribution.
quality is the main variable of interest that this project aims to investigate. The shape of the data in quality seems to follow a normal distribution curve.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
No wine in the data set has a quality score lower than 3 or higher than 9.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
This table of the quality variable shows that most wines have a quality score of 5 or 6. Very few wines have a quality score of 3 and 9.
##
## 3-4 5 6 7 8-9
## 183 1457 2198 880 180
The observations for quality score 3 and 9 were very few in the original quality variable and I don’t think it’s reasonable to compare 5 or 20 data points to hundreds to thousands in the other quality groups. The table above shows the counts for a new quality variable, quality_cat, where the quality scores 3 and 4 and 8 and 9 have been combined to one group respectively.
The data set contains 4898 observations of different variants of the Portuguese “Vinho Verde” with 11 variables on the chemical properties of the wine and one variable on the quality of the wine, which is the median value of the quality rating of at least 3 wine experts, with the range of 0 (very bad) to 10 (very excellent).
Other observations: Most wines have a quality score of 6, there are no wines in the data set with a quality score of lower than 3 or higher than 9. The median alcohol percent in a wine is 10,4 %, the alcohol percent range in the data set is 8-14,2 %. White wine is highly acidic with a median pH level of 3.18 in the data set.
The main feature to investigate in this project is how a White wine is considered high in quality, which chemical properties of a wine might affect how its quality is perceived.
According to Wine Folly, the 5 basic characteristics of wine are:
1. Sweetness
2. Acidity
3. Tannin
4. Alcohol
5. Body
A sweet wine has a higher content of residual sugar in it, and its opposite is a dry wine. However, dry wines can have a hint of sweetness to make them more full-bodied. The data set description states that wines with residual sugar of more than 45 g/l are considered sweet. I want to expolre if the variable residual.sugar influences the quality score.
Wines with higher acidity feel ligher-bodied and wines that have less acidity are more rich and round. The variables fixed.acidity, volatile.acidity, citric.acid and pH should all be related to the acidity of a wine. It is especially interesting to explore how volatile acidity incluences the quality score, since too high levels of it can give the vine a vinegar taste according to the data set description. Citric acid and its relationship to the quality score of a wine is also interesting because it can add freshness and flavor to a wine.
The characteristics Tannin and Body can’t really be related to any of the variables in the data set and will be hard to investigate but the variable alcohol and its relationship to the quality scores should be explored. Wines with higher alcohol tend to taste bolder and more oily and wines with lower alcohol tend to taste lighter-bodied, according to Whine Folly.
Furthermore, the variable free.sulfur.dioxide should be explored since concentrations of more than 50 ppm in a wine becomes evident in the nose and taste of it, according to the data set description. The variable chlorides should also be looked into since the amount of salt in a wine most likely affects how salty it tastes which could also impact the quality score of it.
I created a new variable called quality_cat where I combined the observations with quality score of 3 and 4 and 8 and 9 respectively because quality scores 3 and 9 had so few observations and would be hard to fairly compare to the other scores which had many more observations. The new variable has 5 different quality score groups: 3-4, 5, 6, 7, 8-9.
Most variables seem to have a somewhat normal distribution or a slightly positive skew (long tail). I tranformed the data by log10 to make it less skewed, and the variables chlorides, free.sulfur.dioxide and sulphates appear to have log-normal distributions meanwhile residual.sugar seems to have a bimodal distribution and alcohol a mutlimodal distribution.
I removed column X in the data set because this was just a generic sequence number.
The variable citric.acid has some zero values so I created a subset excluding these to see if they affected the quality variable in some way. The zero values can be correct since the overall values for citric acid in the data are very low (0-1.66), or they could be due to a data entry or rounding error. Removing the 19 observations with zero values for citric acid didn’t seem to affect quality and I decided to keep these in the data set.
In this correlation matrix we can see how strongly two variables are linearly related.
There is a strong to weak positive relationship between residual.sugar and density (0.84), free.sulfur.dioxide and total.sulfur.dioxide (0.62), density and total.sulfur.dioxide (0.53) and residual.sugar and total.sulfur.dioxide (0.4). This means that when one of the variables gets larger, the other gets larger too.
There is a strong to weak negative relationship between alcohol and density (-0.78), alcohol and residual.sugar (-0.45), alcohol and total.sulfur.dioxide (-0.45), pH and fixed.acidity (-0.43) and alcohol and chlorides (-0.36). This is an inverse correlation, as one variable gets larger the other gets smaller.
quality has a weak positive relationship with alcohol (0.44) and a very weak negative relationship with density (-0.31). Other than that, quality doesn’t appear to be linearly related to any other variable.
The correlation coefficient for pH and free.sulfur.dioxide and total.sulfur.dioxide respectively is 0 so there seems to be no linear relationship between the variables.
Fixed and volatile acidity are not linearly related as one might have assumed (-0.02).
## $`3-4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.40 10.10 10.17 10.80 13.50
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.000 9.200 9.500 9.809 10.300 13.600
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.60 10.60 11.40 11.37 12.30 14.20
##
## $`8-9`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.65 12.60 14.00
To illustrate how each chemical property differs over the quality scores, a graph with boxplots, jitter and a summary line of the mean for each quality score is used for each variable. Levels of alcohol has the most obvious pattern of increase with the increase in quality score. Though, levels are high for quality 3-4, with a decrease for quality 5 but then an increase until the highest alcohol level can be seen for quality 8-9.
## $`3-4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.830 3.060 3.160 3.183 3.285 3.720
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.790 3.080 3.160 3.169 3.240 3.790
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.080 3.180 3.189 3.280 3.810
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.840 3.100 3.200 3.214 3.320 3.820
##
## $`8-9`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.940 3.127 3.230 3.221 3.330 3.590
Levels of pH slighly increase the higher the quality score. There is quite a lot of spread in pH within each quality score but it appears that slightly less acidic wines have a higher quality score.
## $`3-4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.250 0.380 0.470 0.476 0.540 0.870
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2700 0.4200 0.4700 0.4822 0.5300 0.8800
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2300 0.4100 0.4800 0.4911 0.5500 1.0600
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4800 0.5031 0.5800 1.0800
##
## $`8-9`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2500 0.3800 0.4600 0.4857 0.5825 0.9500
sulphates appear to have a decrease for quality 8-9 and the spread appears to be the greatest there as well.
## $`3-4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 9.00 18.00 26.63 33.50 289.00
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 22.00 35.00 36.43 50.00 131.00
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 24.00 34.00 35.65 46.00 112.00
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 25.00 33.00 34.13 41.00 108.00
##
## $`8-9`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 28.00 34.50 36.63 44.25 105.00
free.sulfur.dioxide over 50 ppm (mg/dm^3) becomes evident in the nose and taste of wine and levels of it are lower for quality 3-4, but seems to be quite similar over the other quality scores.
## $`3-4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9892 0.9926 0.9941 0.9943 0.9960 1.0004
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9872 0.9933 0.9953 0.9953 0.9972 1.0024
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9876 0.9917 0.9937 0.9940 0.9959 1.0390
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9906 0.9918 0.9925 0.9937 1.0004
##
## $`8-9`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9903 0.9916 0.9922 0.9935 1.0006
density has very little spread within a quality score and it appears that higher quality wines have a slightly lower density.
## $`3-4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01300 0.03750 0.04600 0.05056 0.05400 0.29000
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500
##
## $`8-9`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01400 0.03000 0.03550 0.03801 0.04400 0.12100
Levels of chlorides appear to be decreasing the higer the quality score, which should mean that higher quality wines are less salty in taste.
## $`3-4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.110 0.260 0.320 0.376 0.460 1.100
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.240 0.280 0.302 0.340 0.905
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2000 0.2500 0.2606 0.3000 0.9650
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.1900 0.2500 0.2628 0.3200 0.7600
##
## $`8-9`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.120 0.200 0.260 0.278 0.330 0.660
volatile.acidity appears to be higher for quality scores 3-4 and getting less over higher qualities, except for the highest quality 8-9 where it is very slightly increased. Too high levels of volitile acidity could give the wine an unpleasant vinegar taste, so it makes sense that the lowest scored wines have the highest content of volatile acidity. Quality 8-9 has a higher min and lower max value than the rest and seems to be more concentrated around the mean, which might indicate that there needs to be a certain amount of volatile acidity in a wine to be ranked high, but it also can’t be too high because of the vinegar taste it can result in. This max might be 0.66 g/l as it’s the max value for wines with quality score 8-9.
## $`3-4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.0 85.5 119.0 130.2 177.0 440.0
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 121.0 151.0 150.9 182.0 344.0
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.0 107.2 132.0 137.0 164.0 294.0
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34.0 101.0 122.0 125.1 144.2 229.0
##
## $`8-9`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 59.0 102.8 122.0 125.9 148.5 212.5
Levels of total.sulfur.dioxide are increased for quality 5 and the decreaes over the quality scores. Quality 8-9 has the highest min value and lowest max value, the total.sulfur.dioxide levels seem to be more concentrated around the mean.
## $`3-4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.200 6.400 6.900 7.181 7.650 11.800
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.500 6.400 6.800 6.934 7.400 10.300
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.838 7.300 14.200
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.200 6.200 6.700 6.735 7.200 9.200
##
## $`8-9`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.900 6.200 6.800 6.678 7.300 9.100
fixed.acidity is very slighly higher for quality 3-4.
## $`3-4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.350 2.700 4.821 7.500 17.550
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.800 7.000 7.335 11.500 23.500
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.700 5.300 6.442 9.900 65.800
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.700 3.650 5.186 7.325 19.250
##
## $`8-9`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 2.075 4.300 5.628 8.150 14.800
residual.sugar is higher for quality score 5 than the others. Residual sugar levels of 45g/dm^3 or higher makes a wine sweet, but in this data set only one wine has a level higher than that and all wines appear to be less sweet with no great difference across quality scores.
## $`3-4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2050 0.3000 0.3077 0.4000 0.8800
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2400 0.3200 0.3377 0.4100 1.0000
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.270 0.320 0.338 0.380 1.660
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.2800 0.3100 0.3256 0.3600 0.7400
##
## $`8-9`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.2800 0.3200 0.3282 0.3600 0.7400
Levels of citric.acid are very similar over the different quality scores. Small quantities of citric acid adds freshness and flavor to the wine, but the levels in these wines do not seem to have affected the difference in quality scores very much.
alcohol has the strongest linear relationship with quality (positive relationship). These 4 scatter plots show the relationship between alcohol and density, residual.sugar, total.sulfur.dioxide and chlorides respectively, where an increase in the percent of alcohol shows a decrease in each of the other variables. These are the variables with the strongest correlation to alcohol.
density is the second variable that has a linear relationship with quality (negative relationship). The graphs above illustrate the strongest relationship of density and other variables.
The data set description explains that density is affected by the percent alcohol and sugar content in the wine. The first scatter plot shows the relationship between residual.sugar and density, where an increase in the amount of residual sugar shows an increase in the density of the wine. The third scatter plot shows that with an increase in the alcohol percent there is a decrease in the density of the wine. Lastly, the second graph shows that density also has a linear relationship with total.sulfur.dioxide, where density increases when total.sulfur.dioxide increases.
The correlation matrix showed that quality only had a linear relationship with two other variables. It has a weak positive relationship with alcohol (0.44) and a very weak negative relationship with density (-0.31). This means that with the increase in quality score, the percent alcohol in a wine also increases meanwhile the density of it decreases.
Low quality wines with the score 3-4 have low levels of free.sulfur.dioxide, but the levels do not increase over the other quality scores, it may just be that there must be a certain minimum level of free sulfur dioxide in a wine for it to be rated higher. Higher quality wines have lower levels of chlorides, this might indicate that they are less salty in taste. Higher quality wines also have lower levels of volatile.acidity.
pH and fixed.acidity has a weak negative relationship (-0.43), as the pH increase (gets more alkaline) the fixed.acidity decreases, which makes a lot of sense. fixed.acidity and volatile.acidity are not linearly related (-0.02), which I would have assumed.
I specifically looked at the variables alcohol and density and how these are related to other variables because these were the two variables with the strongest relationship to quality. Alcohol is only negatively related to other variables with the strongest correlation to density (-0.78), residual.sugar (-0.45), total.sulfur.dioxide (-0.45) and chlorides (-0.36). Density is positively related to residual.sugar (0.84) and total.sulfur.dioxide (0.53) and negatively related to alcohol (-0.78).
The strongest relationship found between two variables is the positive relationship between residual.sugar and density with the correlation coefficient of 0.84.
quality only had a linear relationship with two variables, alcohol (r = 0.44) and density (r = -0.31). Alcohol and density are also linearly related (r = -0.78) and the scatter plot above shows this relationship colored by quality to illustrate how they all relate. The graph nicely shows that the higher the alcohol percent and the lower the density of a wine, the higher the quality score.
These graphs illustrate all positive linear relationships we could find in the correlation matrix in the bivariate plots section and here we have added quality as color to explore how the relationship between these variables relate to the quality of a wine. We can see a clear pattern for density, residual.sugar and quality, where the quality of a wine appears to be higher when density and residual sugar are lower. Quality also seems to be increasing for wines with lower density and total.sulfur.dioxide, even though the relationship between these three variables is a bit less obvious. The other graphs show some regions with higher concentrations of good and bad wines but not very clearly.
Here we see graphs of all negative relationships we found in the correlation matrix and how these relationships relate to the quality of a wine. These relationships do not as clearly explain the effects on quality. A wine with higher alcohol percent appears to have a higher quality score in all graphs, but the spread for the second variable is much greater in terms of quality and this relationship does not seem to explain the quality of a wine very well. There is some indication that lower residual.sugar and higher alcohol and slighly less chlorides and higher alcohol in a wine leads to an increase in the quality score. The relationship between fixed.acidity and pH does not seems to explain the quality of a wine much at all.
##
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol, data = white_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8348 -0.4934 -0.0379 0.4637 3.1143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.502e+02 1.880e+01 7.987 1.71e-15 ***
## fixed.acidity 6.552e-02 2.087e-02 3.139 0.00171 **
## volatile.acidity -1.863e+00 1.138e-01 -16.373 < 2e-16 ***
## citric.acid 2.209e-02 9.577e-02 0.231 0.81759
## residual.sugar 8.148e-02 7.527e-03 10.825 < 2e-16 ***
## chlorides -2.473e-01 5.465e-01 -0.452 0.65097
## free.sulfur.dioxide 3.733e-03 8.441e-04 4.422 9.99e-06 ***
## total.sulfur.dioxide -2.857e-04 3.781e-04 -0.756 0.44979
## density -1.503e+02 1.907e+01 -7.879 4.04e-15 ***
## pH 6.863e-01 1.054e-01 6.513 8.10e-11 ***
## sulphates 6.315e-01 1.004e-01 6.291 3.44e-10 ***
## alcohol 1.935e-01 2.422e-02 7.988 1.70e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7514 on 4886 degrees of freedom
## Multiple R-squared: 0.2819, Adjusted R-squared: 0.2803
## F-statistic: 174.3 on 11 and 4886 DF, p-value: < 2.2e-16
The table shows a summary of the results of a model for predicting the quality of a white wine using all other variables as predictor variables. The p-value of the F-statistic is < 2.2e-16, which means it is highly significant and that at least one of the predictor variables is significantly related to the outcome variable.
We can also see statistics for each variable and it’s relationship to quality. The P value (Pr(>|t|)) indicates whether a variable has statistically significant predictive capability. It shows the probability of the coefficient being attributed to random variation. A P value of less than 5% indicates significance. Variables citric.acid, chlorides and total.sulfur.dioxide are not significant and can be removed from the model.
##
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + residual.sugar +
## free.sulfur.dioxide + density + pH + sulphates + alcohol,
## data = white_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8246 -0.4938 -0.0396 0.4660 3.1208
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.541e+02 1.810e+01 8.514 < 2e-16 ***
## fixed.acidity 6.810e-02 2.043e-02 3.333 0.000864 ***
## volatile.acidity -1.888e+00 1.095e-01 -17.242 < 2e-16 ***
## residual.sugar 8.285e-02 7.287e-03 11.370 < 2e-16 ***
## free.sulfur.dioxide 3.349e-03 6.766e-04 4.950 7.67e-07 ***
## density -1.543e+02 1.834e+01 -8.411 < 2e-16 ***
## pH 6.942e-01 1.034e-01 6.717 2.07e-11 ***
## sulphates 6.285e-01 9.997e-02 6.287 3.52e-10 ***
## alcohol 1.932e-01 2.408e-02 8.021 1.31e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7512 on 4889 degrees of freedom
## Multiple R-squared: 0.2818, Adjusted R-squared: 0.2806
## F-statistic: 239.7 on 8 and 4889 DF, p-value: < 2.2e-16
To see how accurate the model is we can look at the R-squared value. An R-squared value close to 1 indicates that the model explains a large portion of the variance in the outcome variable. However, R-squared will always increase when more variables are added to the model even if those variables are only weakly associated with the outcome. We should look at adjusted R-squared instead which takes into account the number of predictor variables in the model.
The adjusted R-squared value for the model is 0.2806, practically the same result as for the previous model with all variables included (0.2803). This means that 28% of the variance in the measure of quality can be predicted by the variables included, which is very low and doesn’t make the model accurate enough, but at least removing some predictor variables made the model just as strong and we can create the same accuracy with fewer data points about a wine available.
I looked at alcohol and density and how the relationship between these variables could explain the quality score of a wine. These were the only two variables that had a direct linear relationship with quality and as expected they quite nicely explain the quality of a wine - the higher the alcohol percent and the lower the density of a wine, the higher the quality score. The graph illustrating the relationship between density, residual sugar and quality also nicely shows how the quality score increases as both density and residual sugar in a wine gets lower.
I think the most interesting thing I discovered is that with so many descriptive variables of wine at hand, only 28% of the variance in the measure of quality can be predicted by the variables included in the multiple linear regression model I created. In terms of the multivariate graphs plotted, it is surprising to see that so few relationships between variables seem to explain the quality score of a wine.
I created a multiple linear regression model for predicting the quality of a wine by including all variables quantifying the chemical properties of the wine to understand which variables are significant to predicting the quality score. I refined this model by removing variables that did not have significant predictive capability of quality. This made the model just as strong as before in predicting quality with a very similar adjusted R-squared value as to the first model, but with fewer data points about a wine needed as input.
As mentioned in the section above, the model is not very strong in predicting the quality of a wine even though there are so many variables on the properties of a wine available. This might be because the bulk of wines in the data set are of average quality, there are no poor wines with a quality score of 1 or excellent wines with a quality score of 10 available so the model will not be able to predict these scores very well. The model might also be weak in predicting quality because there are non-linear relationships between quality and other variables, which this model cannot take into account very well. The model is also not very good at predicting the quality of a white wine since only wines of the kind Vinho Verde are included in the data set and the model could only represent this kind of wine and not white wines in general.
This graph shows the strongest relationship that quality has to another variable, alcohol. The boxplots show the quantile boundaries and the median values and the many dots show the number of observations and how they are distributed within each quality score. Most wines are of quality score 5 or 6 and few have a score of 3-4 or 8-9. The graph nicely shows how big of an impact alcohol has on quality and the general trend in this relationship is clearly illustrated by the orange line representing the median alcohol percent per quality score - higher quality wines have a higher alcohol percent.
Quality only has a linear relationship with two variables, alcohol and density, which are also linearly related and this scatter plot nicely shows how all these variables relate. The relationship between alcohol and density explains the quality score of a wine very well and this graphs illustrates this by the negative slope of the scatter plot and the color of the dots. The higher the alcohol percent and the lower the density of a wine, the higher the quality score will be.
This last graph illustrates the strongest relationship between any two variables in the data set, density and residual sugar. Quality has been added to this to illustrate how this relationship can explain the quality score of a wine. The positive slope of the scatter plot and the color of the dots clearly visualize how the quality score of a wine increases as the density and residual sugar in a wine gets lower.
I think the data set has been easy to work with, it is a tidy data set which has already been cleaned and structured and has no missing data and I could start with the exploration straight away. It was fun to use real data for this project and I have leared a lot about the chemical properties of a wine. The text description of the data set was also very helpful in deciding which variables would be useful to further explore, I could understand from this description how a chemical property could affect the taste or smell of a wine. There was no information on the amount of tannin in each wine, which I mentioned further up affects the taste of a wine. Tannins add balance and structure to a wine and it would have been interesting to see how much this chemical property affect the quality score of a wine.
Alcohol content seems to be the number one chemical property that influences the quality of a wine, followed by its density and residual sugar content. Higher alcohol percent, which also means a lower density of the wine and less residual sugar (less sweet) appears to be characteristic of a higher quality wine. It also seems higher quality wines have lower amounts of chlorides, which could mean they are slighlty less salty. It was surprising to see that there was no linear realtionship between fixed and volatile acidity, which I would have assumed beforehand.
The data set had enough observations in terms of quantity to make the analysis but most of the wines were of average quality and we have no really bad or really excellent wines present, which makes it very hard to predict the quality of a wine with the data at hand. More observations in the lower and upper ranges of quality score would useful in creating prediction models about which chemical properties that influence the quality of a wine. It was also a bit challenging to see the trends in the data by only looking at the graphs because there could be small changes in the data that are hard to see, the correlation matrix and summary tables really helped with understanding this.
I also think that it would be even more insteresting to use a data set that represents different kinds of white wine, this data set only includes observations of the wine Vihno Verde and our analysis will only represent this kind of wine and not white wine in greneral. In future this would be interesting to look into, how these chemical properties affect the quality of white wine in a broader sense.